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Abstract 

We present a technique for automatic induction 
of slot annotations for subcategorization frames, 
based on induction of hidden classes in the EM 
framework of statistical estimation. The models 
are empirically evalutated by a general decision 
test. Induction of slot labeling for subcatego- 
rization frames is accomplished by a further ap- 
plication of EM, and applied experimentally on 
frame observations derived from parsing large 
corpora. We outline an interpretation of the 
learned representations as theoretical-linguistic 
decompositional lexical entries. 

1 Introduction 

An important challenge in computational lin- 
guistics concerns the construction of large-scale 
computational lexicons for the numerous natu- 
ral languages where very large samples of lan- 



guage use are now available. Resnik (1993 ) initi- 
ated research into the automatic acquisition of 
semantic selectional restrictions. Ribas (1994| ) 
presented an approach which takes into account 
the syntactic position of the elements whose 
semantic relation is to be acquired. However, 
those and most of the following approaches re- 
quire as a prerequisite a fixed taxonomy of se- 
mantic relations. This is a problem because (i) 
entailment hierarchies are presently available 
for few languages, and (ii) we regard it as an 
open question whether and to what degree ex- 
isting designs for lexical hierarchies are appro- 
priate for representing lexical meaning. Both 
of these considerations suggest the relevance of 
inductive and experimental approaches to the 
construction of lexicons with semantic informa- 
tion. 

This paper presents a method for automatic 
induction of semantically annotated subcatego- 
rization frames from unannotated corpora. We 



use a statistical subcat-induction system which 
estimates probability distributions and corpus 
frequencies for pairs of a head and a sub cat 
frame ( |Carroll and Rooth, 1998| ) . The statistical 
parser can also collect frequencies for the nomi- 
nal fillers of slots in a sub cat frame. The induc- 
tion of labels for slots in a frame is based upon 
estimation of a probability distribution over tu- 
ples consisting of a class label, a selecting head, 
a grammatical relation, and a filler head. The 
class label is treated as hidden data in the EM- 
framework for statistical estimation. 

2 EM-Based Clustering 

In our clustering approach, classes are derived 
directly from distributional data — a sample of 
pairs of verbs and nouns, gathered by pars- 
ing an unannotated corpus and extracting the 
fillers of grammatical relations. Semantic classes 
corresponding to such pairs are viewed as hid- 
den variables or unobserved data in the con- 
text of maximum likelihood estimation from 
incomplete data via the EM algorithm. This 
approach allows us to work in a mathemati- 
cally well-defined framework of statistical infer- 
ence, i.e., standard monotonicity and conver- 
gence results for the EM algorithm extend to 
our method. The two main tasks of EM-based 
clustering are i) the induction of a smooth prob- 
ability model on the data, and ii) the auto- 
matic discovery of class-structure in the data. 
Both of these aspects are respected in our ap- 
plication of lexicon induction. The basic ideas 
of our EM-based clustering approach were pre- 
sented in [Rooth (Ms ). Our approach contrasts 
with the merely heuristic and empirical justifi- 
cation of similarity-based approaches to cluster- 
ing ( Pagan et al., 1998| ) for which so far no clear 
probabilistic interpretation has been given. The 
probability model we use can be found ear- 
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Figure 1: Class 17: scalar change 



lier in Pereira et al. (1993 ). However, in con- 
trast to this approach, our statistical inference 
method for clustering is formalized clearly as 
an EM-algorithm. Approaches to probabilistic 
clustering similar to ours were presented re- 
cently in Saul and Pereira (1997[ ) and Hofmann 
and Puzicha (1998| ). There also EM-algorithms 
for similar probability models have been de- 
rived, but applied only to simpler tasks not in- 
volving a combination of EM-based clustering 
models as in our lexicon induction experiment. 
For further applications of our clustering model 
see |Rooth et al. (1998| ). 

We seek to derive a joint distribution of verb- 
noun pairs from a large sample of pairs of verbs 
v G V and nouns n G N. The key idea is to view 
v and n as conditioned on a hidden class c G C, 
where the classes are given no prior interpreta- 
tion. The semantically smoothed probability of 
a pair (v, n) is defined to be: 



p(v,n) = ^2p(c,v,n) 

etc 



p(c)p(v\c)p(n\c) 



The joint distribution p(c, v, n) is defined by 
p(c,v,n) = p{c)p{v\c)p(n\c) . Note that by con- 
struction, conditioning of v and n on each other 
is solely made through the classes c. 

In the framework of the EM algorithm 



( Dempster et al., 1977 ), we can formalize clus- 
tering as an estimation problem for a latent 



class (LC) model as follows. We are given: (i) 
a sample space y of observed, incomplete data, 
corresponding to pairs from V x N, (ii) a sam- 
ple space X of unobserved, complete data, cor- 
responding to triples from C x V x N, (iii) a 
set X(y) = {x G X | x = (c,y), c G C} 
of complete data related to the observation y, 
(iv) a complete-data specification pg(x), cor- 
responding to the joint probability p(c, v, n) 
over C x V x N, with parameter- vector 9 = 
{Qc,9vc,9nc\c G C, v G V, n G N), (v) an incom- 
plete data specification p$(y) which is related to 
the complete-data specification as the marginal 
probability p e (y) = Y.x( y )Pe( x )- 

The EM algorithm is directed at finding a 
value 9 of 9 that maximizes the incomplete- 
data log-likelihood function L func- 
tion of 9 for a given sample y, i.e., 9 = 

argmax L{9) where L(9) = In JT pg(y). 
e y 

As prescribed by the EM algorithm, the pa- 
rameters of L(0) are estimated indirectly by 
proceeding iteratively in terms of complete-data 
estimation for the auxiliary function Q(9;9^), 
which is the conditional expectation of the 
complete-data log- likelihood In pg(x) given the 
observed data y and the current fit of the 
parameter values 9® (E-step). This auxiliary 
function is iteratively maximized as a function 
of 9 (M-step), where each iteration is defined by 
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Figure 2: Class 5: communicative action 



the map 6»(* +1 ) = M(6^) = argmax Q(0; #(*)) 

Note that our application is an instance of the 
EM-algorithm for context-free models ( Baum ct| 
al., 1970| ), from which the following particularly 



3implc rccstimation formulae can be derived. 



Qj e t x — (c,y), and f(y) the sample-frequency 
of y. Then 



pating in the grammatical relations of intransi- 
tive and transitive verbs and their subject- and 
object-fillers. The data were gathered from the 
maximal-probability parses the head-lexicalized 
probabilistic context-free grammar of ( par roll 
and Rooth, 1998| ) gave for the British National 
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Intuitively, the conditional expectation of the 
number of times a particular v, n, or c choice is 
made during the derivation is prorated by the 
conditionally expected total number of times a 
choice of the same kind is made. As shown by 
Baum et al. (1970| ), these expectations can be 



calculated efficiently using dynamic program- 
ming techniques. Every such maximization step 
increases the log-likelihood function L, and a se- 
quence of re-estimates eventually converges to a 
(local) maximum of L. 

In the following, we will present some exam- 
ples of induced clusters. Input to the clustering 
algorithm was a training corpus of 1280715 to- 
kens (608850 types) of verb-noun pairs partici- 



Corpus (117 million words). 




Figure 3: Evaluation of pseudo-disambiguation 

Fig. ^ shows an induced semantic class out of 
a model with 35 classes. At the top are listed 
the 20 most probable nouns in the p(n\5) dis- 
tribution and their probabilities, and at left are 
the 30 most probable verbs in the p(v\5) dis- 
tribution. 5 is the class index. Those verb- noun 
pairs which were seen in the training data ap- 
pear with a dot in the class matrix. Verbs with 
suffix .as : s indicate the subject slot of an ac- 
tive intransitive. Similarily .aso : s denotes the 
subject slot of an active transitive, and .aso : o 
denotes the object slot of an active transitive. 



Thus v in the above discussion actually consists 
of a combination of a verb with a subcat frame 
slot as : s, aso : s, or aso : o. Induced classes 
often have a basis in lexical semantics; class 
5 can be interpreted as clustering agents, de- 
noted by proper names, "man", and "woman", 
together with verbs denoting communicative ac- 
tion. Fig. |] shows a cluster involving verbs of 
scalar change and things which can move along 
scales. Fig. || can be interpreted as involving dif- 
ferent dispositions and modes of their execution. 

3 Evaluation of Clustering Models 

3.1 Pseudo-Disambiguation 

We evaluated our clustering models on a 
pseudo-disambiguation task similar to that per- 
formed in Pereira et al. (1993j ), but differing in 
detail. The task is to judge which of two verbs 
v and v' is more likely to take a given noun n as 
its argument where the pair (v, n) has been cut 
out of the original corpus and the pair (v',n) is 
constructed by pairing n with a randomly cho- 
sen verb v' such that the combination (v',n) is 
completely unseen. Thus this test evaluates how 
well the models generalize over unseen verbs. 

The data for this test were built as follows. 
We constructed an evaluation corpus of (v, n, v') 
triples by randomly cutting a test corpus of 3000 
(v, n) pairs out of the original corpus of 1280712 
tokens, leaving a training corpus of 1178698 to- 
kens. Each noun n in the test corpus was com- 
bined with a verb v' which was randomly cho- 
sen according to its frequency such that the pair 
(v',n) did appear neither in the training nor in 
the test corpus. However, the elements v, v', and 
n were required to be part of the training cor- 
pus. Furthermore, we restricted the verbs and 
nouns in the evaluation corpus to the ones which 
occurred at least 30 times and at most 3000 
times with some verb-functor v in the train- 
ing corpus. The resulting 1337 evaluation triples 
were used to evaluate a sequence of clustering 
models trained from the training corpus. 

The clustering models we evaluated were pa- 
rameterized in starting values of the training al- 
gorithm, in the number of classes of the model, 
and in the number of iteration steps, resulting 
in a sequence of 3 x 10 x 6 models. Starting 
from a lower bound of 50 % random choice, ac- 
curacy was calculated as the number of times 
the model decided for p(n\v) > p(n\v') out of 




Figure 4: Evaluation on smoothing task 

all choices made. Fig. [3| shows the evaluation 
results for models trained with 50 iterations, av- 
eraged over starting values, and plotted against 
class cardinality. Different starting values had 
an effect of - 2 % on the performance of the 
test. We obtained a value of about 80 % ac- 
curacy for models between 25 and 100 classes. 
Models with more than 100 classes show a small 
but stable overfitting effect. 

3.2 Smoothing Power 

A second experiment addressed the smoothing 
power of the model by counting the number of 
(v, n) pairs in the set V x ./V of all possible combi- 
nations of verbs and nouns which received a pos- 
itive joint probability by the model. The V x N- 
space for the above clustering models included 
about 425 million (v,n) combinations; we ap- 
proximated the smoothing size of a model by 
randomly sampling 1000 pairs from V x N and 
returning the percentage of positively assigned 
pairs in the random sample. Fig. [| plots the 
smoothing results for the above models against 
the number of classes. Starting values had an in- 
fluence of - 1 % on performance. Given the pro- 
portion of the number of types in the training 
corpus to the V x TV-space, without clustering 
we have a smoothing power of 0.14 % whereas 
for example a model with 50 classes and 50 it- 
erations has a smoothing power of about 93 %. 

Corresponding to the maximum likelihood 
paradigm, the number of training iterations had 
a decreasing effect on the smoothing perfor- 
mance whereas the accuracy of the pseudo- 
disambiguation was increasing in the number of 
iterations. We found a number of 50 iterations 
to be a good compromise in this trade-off. 



4 Lexicon Induction Based on 
Latent Classes 

The goal of the following experiment was to de- 
rive a lexicon of several hundred intransitive and 
transitive verbs with subcat slots labeled with 
latent classes. 

4.1 Probabilistic Labeling with Latent 
Classes using EM- estimation 

To induce latent classes for the subject slot of 
a fixed intransitive verb the following statisti- 
cal inference step was performed. Given a la- 
tent class model plc(') for verb-noun pairs, and 
a sample m, . . . , um of subjects for a fixed in- 
transitive verb, we calculate the probability of 
an arbitrary subject n € N by: 



p( n ) = *}2p( c > 



n 



c€C 



cec 



LC 



(n|c) 



The estimation of the parameter-vector 6 = 
(0 c \c £ C) can be formalized in the EM frame- 
work by viewing pin) or p(c, n) as a function of 9 
for fixed Plc{-)- The re-estimation formulae re- 
sulting from the incomplete data estimation for 
these probability functions have the following 
form (f(n) is the frequency of n in the sample 
of subjects of the fixed verb): 



M(0 C ) 



Enejv f( n )Pe(c\n) 
Eneiv/( n ) 



A similar EM induction process can be applied 
also to pairs of nouns, thus enabling induction of 
latent semantic annotations for transitive verb 
frames. Given a LC model pic(') f° r verb-noun 
pairs, and a sample (rti, ri2)\, ■ ■ ■ , (rti, U2)m of 
noun arguments {n\ subjects, and rt2 direct ob- 
jects) for a fixed transitive verb, we calculate 
the probability of its noun argument pairs by: 

p{n 1 ,n 2 ) = E Cl , C2 ecP( c i> c 2> n i> ra 2) 
= E Cl , C2 gC? , ( C i' C 2)? , ic(ni|ci)pLc("'2|c2) 

Again, estimation of the parameter-vector 
9 = (#cic 2 l c ii c 2 6 C) can be formalized 
in an EM framework by viewing pinion-?) or 
p(ci, C2,rii, n-i) as a function of 8 for fixed 
Plc{ )- The re-estimation formulae resulting 
from this incomplete data estimation problem 
have the following simple form (/(rti,ri2) is the 
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Figure 7: Scalar motion increase. 

frequency of (rti, 712) in the sample of noun ar- 
gument pairs of the fixed verb): 



M(9 clc2 ) 



Em ,n 2 £JV / ( n i . "2 )pe (ci , C2 1 Wi , w 2 ) 
Eru.naeiv/^l^) 



Note that the class distributions p(c) and 
p( c i7 c 2) for intransitive and transitive models 
can be computed also for verbs unseen in the 
LC model. 

4.2 Lexicon Induction Experiment 

Experiments used a model with 35 classes. From 
maximal probability parses for the British Na- 
tional Corpus derived with a statistical parser 
(Carroll and Rooth, 1998), we extracted fre- 
quency tables for intransitive verb/subject pairs 
and transitive verb/subject /object triples. The 
500 most frequent verbs were selected for slot 
labeling. Fig. ^ shows two verbs v for which 
the most probable class label is 5, a class 
which we earlier described as communicative ac- 
tion, together with the estimated frequencies of 
f(n)p$(c\n) for those ten nouns n for which this 
estimated frequency is highest. 

Fig. shows corresponding data for an intran- 
sitive scalar motion sense of increase. 

Fig. H shows the intransitive verbs which take 
17 as the most probable label. Intuitively, the 



Class 8 

PROB 0.0369 



O N N tO iC (O 



o o oo r- . 
io m m 



OOO CO oooo o ooooo o oooooooooo 
OOO O OOO oo oooo o ooooo o oooooooooo 
odd d odd do dddd d ddddd d odd c> a a a a a a 



2, E 



a * 



0.0539 


require 


aso: 


0.0469 


show 


aso: 


0.0439 


need 


aso: 


0.0383 


involve 


aso 


0.0270 


produce 


aso 


0.0255 


occur. as 


0.0192 


cause 


aso 


0.0189 


cause 


aso: 


0.0179 


affect 


aso 


0.0162 


require 


aso 


0.0150 


mean 


aso: 


0.0140 


suggest 


aso: 


0.0138 


produce 


aso 


0.0109 


demand 


aso 


0.0109 


reduce 


aso 


0.0097 


reflect 


aso: 


0.0092 


involve 


aso 


0.0091 


undergo 


aso: 



Figure 5: Class 8: dispositions 



verbs are semantically coherent. When com- 
pared to Levin (1993| ) 's 48 top-level verb classes, 
we found an agreement of our classification with 
her class of "verbs of changes of state" except 
for the last three verbs in the list in Fig. |8] which 
is sorted by probability of the class label. 

Similar results for German intransitive scalar 
motion verbs are shown in Fig. ^. The data 
for these experiments were extracted from the 
maximal-probability parses of a 4.1 million word 
corpus of German subordinate clauses, yield- 
ing 418290 tokens (318086 types) of pairs of 
verbs or adjectives and nouns. The lexicalized 
probabilistic grammar for German used is de- 
scribed in [Beil et al. (1999| ). We compared 
the German example of scalar motion verbs to 
the linguistic classification of verbs given by 
Schuhmacher (1986) and found an agreement 
of our classification with the class of "einfache 
Anderungsverben" (simple verbs of change) ex- 
cept for the verbs anwachsen (increase) and 
stagnieren (stagnate) which were not classified 
there at all. 



Fig. 10 



shows the most probable pair of 
classes for increase as a transitive verb, together 
with estimated frequencies for the head filler 
pair. Note that the object label 17 is the class 
found with intransitive scalar motion verbs; this 
correspondence is exploited in the next section. 
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Figure 8: Scalar motion verbs 
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Figure 9: German intransitive scalar motion 
verbs 

5 Linguistic Interpretation 

In some linguistic accounts, multi-place verbs 
are decomposed into representations involv- 
ing (at least) one predicate or relation 
per argument. For instance, the transitive 
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latent class labeling. In the second tree in Fig 



Figure 10: Transitive increase with estimated 
frequencies for filler pairs. 
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Figure 11: First tree: linguistic lexical entry for 
transitive verb increase. Second: corresponding 
lexical entry with induced classes as relational 
constants. Third: indexed open class root added 
as conjunct in transitive scalar motion increase. 
Fourth: induced entry for related intransitive in- 
crease. 



causative/inchoative verb increase, is composed 
of an actor /causative verb combining with a 
one-place predicate in the structure on the 
left in Fig. |ll|. Linguistically, such representa- 
tions are motivated by argument alternations 
(diathesis), case linking and deep word order, 
language acquistion, scope ambiguity, by the de- 
sire to represent aspects of lexical meaning, and 
by the fact that in some languages, the pos- 
tulated decomposed representations are overt, 
with each primitive predicate corresponding to 
a morpheme. For references and recent discus- 



sion of this kind of theory see [Hale and Keyser 
(T993|) and [Kural (19961) . 

We will sketch an understanding of the lexical 
representations induced by latent-class labeling 
in terms of the linguistic theories mentioned 
above, aiming at an interpretation which com- 
bines computational learnability, linguistic mo- 
tivation, and denotational-semantic adequacy. 
The basic idea is that latent classes are com- 
putational models of the atomic relation sym- 
bols occurring in lexical-semantic representa- 
tions. As a first implementation, consider re- 
placing the relation symbols in the first tree in 
Fig. |ll| with relation symbols derived from the 



11, i?i7 and R$ are relation symbols with indices 
derived from the labeling procedure of Sect. ||]. 
Such representations can be semantically inter- 
preted in standard ways, for instance by inter- 
preting relation symbols as denoting relations 
between events and individuals. 

Such representations are semantically inad- 
equate for reasons given in philosophical cri- 
tiques of decomposed linguistic representations; 
see Fodor (1998[ ) for recent discussion. A lex- 
icon estimated in the above way has as many 
primitive relations as there are latent classes. 
We guess there should be a few hundred classes 
in an approximately complete lexicon (which 
would have to be estimated from a corpus of 
hundreds of millions of words or more). Fodor 's 
arguments, which are based on the very lim- 
ited degree of genuine interdefinability of lexical 
items and on Putnam's arguments for contex- 
tual determination of lexical meaning, indicate 
that the number of basic concepts has the or- 
der of magnitude of the lexicon itself. More con- 
cretely, a lexicon constructed along the above 
principles would identify verbs which are la- 
belled with the same latent classes; for instance 
it might identify the representations of grab and 
touch. 

For these reasons, a semantically adequate 
lexicon must include additional relational con- 
stants. We meet this requirement in a simple 
way, by including clS 8l conjunct a unique con- 
stant derived from the open-class root, as in 



the third tree in Fig. 11. We introduce index- 
ing of the open class root (copied from the class 
index) in order that homophony of open class 
roots not result in common conjuncts in seman- 
tic representations — for instance, we don't want 
the two senses of decline exemplified in decline 
the proposal and decline five percent to have a 
common entailment represented by a common 
conjunct. This indexing method works as long 
as the labeling process produces different latent 
class labels for the different senses. 



The last tree in Fig. 11 is the learned repre- 
sentation for the scalar motion sense of the in- 
transitive verb increase. In our approach, learn- 
ing the argument alternation (diathesis) relat- 
ing the transitive increase (in its scalar mo- 
tion sense) to the intransitive increase (in its 
scalar motion sense) amounts to learning rep- 



resentations with a common component Rn A 
increase 17 . In this case, this is achieved. 

6 Conclusion 

We have proposed a procedure which maps 
observations of subcategorization frames with 
their complement fillers to structured lexical 
entries. We believe the method is scientifically 
interesting, practically useful, and flexible be- 
cause: 

1. The algorithms and implementation are ef- 
ficient enough to map a corpus of a hundred 
million words to a lexicon. 

2. The model and induction algorithm have 
foundations in the theory of parameter- 
ized families of probability distributions 
and statistical estimation. As exemplified 
in the paper, learning, disambiguation, and 
evaluation can be given simple, motivated 
formulations. 

3. The derived lexical representations are lin- 
guistically interpretable. This suggests the 
possibility of large-scale modeling and ob- 
servational experiments bearing on ques- 
tions arising in linguistic theories of the lex- 
icon. 

4. Because a simple probabilistic model is 
used, the induced lexical entries could be 
incorporated in lexicalized syntax-based 
probabilistic language models, in particular 
in head-lexicalized models. This provides 
for potential application in many areas. 

5. The method is applicable to any natural 
language where text samples of sufficient 
size, computational morphology, and a ro- 
bust parser capable of extracting subcate- 
gorization frames with their fillers are avail- 
able. 
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